##Overview In this report I detail the machine learning (ML) models I implemented to accurately predict the housing prices in Boston suburbs. The data set for this experiment is accessed from the UCI Machine Learning repository via https://archive.ics.uci.edu/ml/datasets/Housing. The report is organized in such a way as to demonstrate the entire process right from getting and cleaning the data, to exploratory analysis of the data set to understand the distribution and importance of various features in influencing the algorithm, to coming with a hypothesis, training ML models, evaluation of the models, etc.
The data set consists of 506 observations of 14 attributes. The median value of house price in $10000s, denoted by MEDV, is the outcome or the dependent variable in our models. Below is a brief description of each predictor feature and the outcome in our data set: variables:
Getting the data into R as an R object, cleaning the data and transforming it as a neat and usable R data frame or equivalent. The df (Boston housing) file consists of the actual data, The R-function readLiness() reads data from fixed-width files. Below, I use subsetting and the strsplit R function to extract the columns/predictors and column names alone from this file.
# DATA DOWNLOWNED
text <- readLines("boston.txt")
text <- text[c(-1, -2)]
i = 1
df2 <- NULL
while (i <= 1012) {
if (i%%2 == 0) {
i = i + 1
} else i
j <- i + 1
texti <- as.numeric(strsplit(text, " ")[[i]])
texti <- na.omit(texti)
textj <- as.numeric(strsplit(text, " ")[[j]])
textj <- na.omit(textj)
textC <- as.vector(c(texti, textj))
df <- NULL
df <- rbind(df2, textC)
colnames(df) <- c("CRIM", "ZN", "INDUS", "CHAS", "NOX", "RM", "AGE", "DIS",
"RAD", "TAX", "PTRATIO", "B", "LSTAT", "MEDV")
rownames(df) <- c()
df2 <- df
i <- j + 1
df <- as.data.frame(df)
}
# first 20 observations
head(formattable(df), 10)
| CRIM | ZN | INDUS | CHAS | NOX | RM | AGE | DIS | RAD | TAX | PTRATIO | B | LSTAT | MEDV |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.00632 | 18.0 | 2.31 | 0 | 0.538 | 6.575 | 65.2 | 4.0900 | 1 | 296 | 15.3 | 396.90 | 4.98 | 24.0 |
| 0.02731 | 0.0 | 7.07 | 0 | 0.469 | 6.421 | 78.9 | 4.9671 | 2 | 242 | 17.8 | 396.90 | 9.14 | 21.6 |
| 0.02729 | 0.0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 392.83 | 4.03 | 34.7 |
| 0.03237 | 0.0 | 2.18 | 0 | 0.458 | 6.998 | 45.8 | 6.0622 | 3 | 222 | 18.7 | 394.63 | 2.94 | 33.4 |
| 0.06905 | 0.0 | 2.18 | 0 | 0.458 | 7.147 | 54.2 | 6.0622 | 3 | 222 | 18.7 | 396.90 | 5.33 | 36.2 |
| 0.02985 | 0.0 | 2.18 | 0 | 0.458 | 6.430 | 58.7 | 6.0622 | 3 | 222 | 18.7 | 394.12 | 5.21 | 28.7 |
| 0.08829 | 12.5 | 7.87 | 0 | 0.524 | 6.012 | 66.6 | 5.5605 | 5 | 311 | 15.2 | 395.60 | 12.43 | 22.9 |
| 0.14455 | 12.5 | 7.87 | 0 | 0.524 | 6.172 | 96.1 | 5.9505 | 5 | 311 | 15.2 | 396.90 | 19.15 | 27.1 |
| 0.21124 | 12.5 | 7.87 | 0 | 0.524 | 5.631 | 100.0 | 6.0821 | 5 | 311 | 15.2 | 386.63 | 29.93 | 16.5 |
| 0.17004 | 12.5 | 7.87 | 0 | 0.524 | 6.004 | 85.9 | 6.5921 | 5 | 311 | 15.2 | 386.71 | 17.10 | 18.9 |
Now, let us check and explore the cleaned data frame containing the housing data. The df R object is of class ‘data.frame’, which is very easy to work with using R scripts. The str() function is powerful in displaying the structure of an R dataframe. Below, the output of str() compactly provides the relevant information of our dataframe, like the number of observations, number of variables, names of each column, the class of each column, and sample values from each column.
## DiSPLAY SUMMARY STATISTICS
summary(df)
## CRIM ZN INDUS CHAS
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## NOX RM AGE DIS
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## RAD TAX PTRATIO B
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## LSTAT MEDV
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
str(df)
## 'data.frame': 506 obs. of 14 variables:
## $ CRIM : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ ZN : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ INDUS : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ CHAS : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NOX : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ RM : num 6.58 6.42 7.18 7 7.15 ...
## $ AGE : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ DIS : num 4.09 4.97 4.97 6.06 6.06 ...
## $ RAD : num 1 2 2 3 3 3 5 5 5 5 ...
## $ TAX : num 296 242 242 222 222 222 311 311 311 311 ...
## $ PTRATIO: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ B : num 397 397 393 395 397 ...
## $ LSTAT : num 4.98 9.14 4.03 2.94 5.33 ...
## $ MEDV : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
Let us visualize the distribution and density of the outcome, MEDV. The black curve represents the density. In addition, the boxplot is also plotted to bring an additional perspective. We see that the median value of housing price is skewed to the right, with a number of outliers to the right. It may be useful to transform ‘MEDV’ column using functions like natural logarithm, while modeling the hypothesis for regression analysis.
p01 <- ggplot(df, aes(x = MEDV)) + xlab("Median value of owner-occupied homes") +
geom_histogram(aes(y = ..density..), binwidth = 1, fill = "green") + geom_density(alpha = 0.8,
fill = "red") + ylim(0, 0.075)
ggplotly(p01 = ggplot2::last_plot())
p02 <- ggplot(df, aes(y = MEDV, x = RM)) + xlab("average number of rooms per dwelling") +
ylab("Median value of owner-occupied homes") + geom_point(colour = "red") +
geom_smooth(method = lm, colour = "blue") + geom_smooth(colour = "orange")
ggplotly(p02 = ggplot2::last_plot())
p03 <- ggplot(df, aes(y = MEDV, x = LSTAT)) + xlab("%lower status of the population") +
ylab("Median value of owner-occupied homes") + geom_point(colour = "green") +
geom_smooth(method = lm, colour = "yellow") + geom_smooth(colour = "red")
ggplotly(p03 = ggplot2::last_plot())
p04 <- ggplot(df, aes(y = MEDV, x = PTRATIO)) + xlab("pupil-teacher ratio by town") +
ylab("Median value of owner-occupied homes") + geom_point(colour = "black") +
geom_smooth(method = lm, colour = "yellow") + geom_smooth(colour = "green")
ggplotly(p04 = ggplot2::last_plot())
b0x <- ggplot(df, aes(y = MEDV, x = c(1:506))) + geom_boxplot(alpha = 0.8, colour = "black",
fill = "green") + coord_flip()
b0x
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
Now, let us do scatter plot of some of the important variables (based on intuition) with the outcome variable MEDV. We see that there is strong positive or negative correlation between these variables and the outcome. It is also obviously evident that INDUS and NOX are strongly positively correlated with one another, as nitric oxide levels tend to go up with increase in industries.
plot(df[,c(3,5,6,11,13,14)],pch = 3)
Correlation and near zero variance A few important properties to check now are the correlation of input features with the dependent variable, and to check if any feature has near zero variance (values not varying much within the column). as.
suppressMessages(library(caret))
cor(df, df$MEDV)
## [,1]
## CRIM -0.3883046
## ZN 0.3604453
## INDUS -0.4837252
## CHAS 0.1752602
## NOX -0.4273208
## RM 0.6953599
## AGE -0.3769546
## DIS 0.2499287
## RAD -0.3816262
## TAX -0.4685359
## PTRATIO -0.5077867
## B 0.3334608
## LSTAT -0.7376627
## MEDV 1.0000000
We see that the number of rooms RM has the strongest positive correlation with the median value of the housing price, while the percentage of lower status population, LSTAT and the pupil-teacher ratio, PTRATIO, have strong negative correlation. The feature with the least correlation to MEDV is the proximity to Charles River, CHAS.
nzv <- nearZeroVar(df, saveMetrics = TRUE)
sum(nzv$nzv)
## [1] 0
The output shows that there are no variable with zero or near zero variance.
Next we perform centering and scaling on the input features. Then we partition the data on a 7/3 ratio as training/test data sets.
#enteringcaling of input features
df <- cbind(scale(df[1:13]), df[14])
`set.seed(12345)
inTrain <- createDataPartition(y = df$MEDV, p = 0.70, list = FALSE)
training <- df[inTrain,]
testing <- df[-inTrain,]
I’m going to develop the tools and techniques necessary for a model to make a prediction. Being able to make accurate evaluations of each model’s performance helps to greatly reinforce the confidence in my predictions.
First, let us try generalized linear regression model with MEDV as the dependent variable and all the remaining variables as independent variables. We train the model with the training data set. For this linear model, below is the coefficients of all the features, and the intercept. Next, we use the trained model to predict the outcome (MEDV) for the testing data set. A good metric to test the accuracy of the model is to calculate the root-mean squared error, which is given by :
\[\sqrt{\sum_{i=1}^{n} \frac{(y_{pred_i} - y_{act_i})^2} {n}}\]
##LINEAR REGRESSION MODEL 1 - using all features
set.seed(12345)
fit.lm <- lm(MEDV ~ ., data = training)
fit.lm
##
## Call:
## lm(formula = MEDV ~ ., data = training)
##
## Coefficients:
## (Intercept) CRIM ZN INDUS CHAS
## 22.7369 -0.9232 0.9084 0.1753 0.8109
## NOX RM AGE DIS RAD
## -2.6324 2.5781 0.3322 -3.2202 2.2919
## TAX PTRATIO B LSTAT
## -1.8564 -2.2865 0.6725 -3.6271
autoplot(fit.lm, which = 1:6, ncol = 3, label.size = 3)
#### CHECK COEFFICIENTS
data.frame(coef = round(fit.lm$coefficients, 2))
## coef
## (Intercept) 22.74
## CRIM -0.92
## ZN 0.91
## INDUS 0.18
## CHAS 0.81
## NOX -2.63
## RM 2.58
## AGE 0.33
## DIS -3.22
## RAD 2.29
## TAX -1.86
## PTRATIO -2.29
## B 0.67
## LSTAT -3.63
summary(fit.lm)
##
## Call:
## lm(formula = MEDV ~ ., data = training)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.911 -2.737 -0.724 1.839 26.275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.7369 0.2529 89.899 < 2e-16 ***
## CRIM -0.9232 0.3071 -3.006 0.002839 **
## ZN 0.9084 0.3875 2.344 0.019647 *
## INDUS 0.1753 0.5193 0.338 0.735852
## CHAS 0.8109 0.2586 3.135 0.001866 **
## NOX -2.6324 0.5379 -4.894 1.53e-06 ***
## RM 2.5781 0.3520 7.324 1.74e-12 ***
## AGE 0.3322 0.4511 0.736 0.462006
## DIS -3.2202 0.5051 -6.376 5.90e-10 ***
## RAD 2.2919 0.6855 3.343 0.000919 ***
## TAX -1.8564 0.7681 -2.417 0.016177 *
## PTRATIO -2.2865 0.3312 -6.903 2.48e-11 ***
## B 0.6725 0.3033 2.217 0.027255 *
## LSTAT -3.6271 0.4349 -8.339 1.85e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.753 on 342 degrees of freedom
## Multiple R-squared: 0.7491, Adjusted R-squared: 0.7396
## F-statistic: 78.54 on 13 and 342 DF, p-value: < 2.2e-16
ggplot(data = df, aes(df$NOX, df$MEDV)) + geom_smooth()
set.seed(12345)
# predict on test set
pred.lm <- predict(fit.lm, newdata = testing)
# Root-mean squared error
rmse.lm <- sqrt(sum((pred.lm - testing$MEDV)^2)/length(testing$MEDV))
c(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared, P_value = summary(fit.lm)$coefficients[1,
4])
## RMSE R2 P_value
## 4.826464e+00 7.490908e-01 5.003731e-240
data.frame(RMSE = rmse.lm, R2 = summary(fit.lm)$r.squared, P_value = summary(fit.lm)$coefficients[1,
4])
## RMSE R2 P_value
## 1 4.826464 0.7490908 5.003731e-240
We see that the RMSE is 4.381992 and the R2R2 value is 0.7239 for this model.
We also saw that the output variable MEDV was skewed to the right. Performing a log transformation would normalize the distribution of MEDV. Let us perform glm with log(MEDV) as the outcome and all remaining features as input. We see that the RMSE value has reduced for this model.
set.seed(12345)
fit.lm1 <- glm(log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B +
LSTAT, data = training)
# predict on test set
summary(fit.lm1)
##
## Call:
## glm(formula = log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO +
## RAD + B + LSTAT, data = training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.72099 -0.09896 -0.01041 0.09383 0.82814
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.04204 0.01042 291.842 < 2e-16 ***
## CRIM -0.09118 0.01260 -7.233 3.05e-12 ***
## CHAS 0.03391 0.01057 3.208 0.00146 **
## NOX -0.11597 0.01988 -5.833 1.25e-08 ***
## RM 0.06886 0.01395 4.937 1.24e-06 ***
## DIS -0.10563 0.01698 -6.220 1.43e-09 ***
## PTRATIO -0.09544 0.01274 -7.492 5.71e-13 ***
## RAD 0.04143 0.01745 2.375 0.01811 *
## B 0.03312 0.01247 2.655 0.00830 **
## LSTAT -0.19588 0.01680 -11.657 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.03839931)
##
## Null deviance: 62.996 on 355 degrees of freedom
## Residual deviance: 13.286 on 346 degrees of freedom
## AIC: -138.32
##
## Number of Fisher Scoring iterations: 2
# predict on test set
pred.lm1 <- predict(fit.lm1, newdata = testing)
pred.lm1
## 1 2 6 11 12 20 22 25
## 3.424905 3.215083 3.243884 2.901499 3.071125 2.936919 2.878708 2.789241
## 26 28 30 36 39 42 47 62
## 2.714936 2.737635 2.981513 3.155189 3.119216 3.345558 3.028396 2.884679
## 70 72 73 80 81 83 85 86
## 3.115354 3.128846 3.264412 3.191590 3.328428 3.247598 3.201864 3.325643
## 87 92 94 95 98 100 102 105
## 3.093660 3.311999 3.336809 3.223000 3.599877 3.482120 3.221758 3.054268
## 112 114 118 120 121 126 127 129
## 3.270493 3.025031 3.194205 3.073372 2.980889 2.990914 2.613032 2.904226
## 137 138 144 148 152 154 156 157
## 2.800400 2.925245 2.531525 2.407082 2.884173 2.807819 2.921925 2.683686
## 161 165 169 170 179 180 186 188
## 3.571201 3.209312 3.249681 3.254000 3.445617 3.475009 3.145311 3.536334
## 198 199 201 204 205 206 219 220
## 3.458007 3.542799 3.380556 3.691603 3.734202 3.120015 3.159950 3.379877
## 221 223 227 229 230 236 239 240
## 3.493257 3.456289 3.630410 3.563359 3.456111 3.198129 3.341913 3.327938
## 242 246 247 254 255 259 265 269
## 3.136338 2.729533 3.027688 3.352089 3.173883 3.531687 3.511827 3.698986
## 270 271 281 284 285 291 294 299
## 3.200219 3.049443 3.678104 3.861841 3.372352 3.415830 3.295066 3.383298
## 302 303 311 312 313 314 316 317
## 3.329713 3.338158 2.963462 3.289098 3.117618 3.223568 3.036883 2.864571
## 324 328 335 336 343 344 345 349
## 2.992317 2.967962 3.054559 3.019671 3.202453 3.265521 3.313382 3.229837
## 352 354 359 366 371 372 378 379
## 3.078071 3.117479 3.010418 2.844805 3.531225 3.102840 2.793666 2.537935
## 380 383 384 389 393 397 398 408
## 2.635077 2.572205 2.559339 2.251644 2.431292 2.810711 2.712854 2.896913
## 409 411 423 425 429 430 431 432
## 2.601788 2.487682 2.842687 2.705089 2.607256 2.519765 2.801332 2.773441
## 433 434 439 442 448 450 457 458
## 2.982320 2.743538 2.106845 2.695378 2.756805 2.716869 2.585770 2.588698
## 466 467 468 472 473 481 485 486
## 2.865484 2.691547 2.775159 3.080388 3.026308 3.105431 2.958702 3.056963
## 491 497 500 501 502 506
## 2.416903 2.732105 2.936100 2.997094 3.090174 3.084712
range(pred.lm1)
## [1] 2.106845 3.861841
# Root-mean squared error
rmse.lm1 <- sqrt(sum((exp(pred.lm1) - testing$MEDV)^2)/length(testing$MEDV))
c(RMSE = rmse.lm1, R2 = summary(fit.lm1)$r.squared, P_value = summary(fit.lm1)$coefficients[1,
4])
## RMSE P_value
## 4.612483 0.000000
c(RMSE = rmse.lm1/10, R2 = summary(fit.lm1)$r.squared/10)
## RMSE
## 0.4612483
We see that the RMSE is 4.381992 and the R2 value is 0.7239 for this model.
Let us examine the calculated p-value for each feature in the linear model. Any feature which is not significant (p<0.05) is not contributing significantly for the model, probably due to multicollinearity among other features. We see that the features, ZN, INDUS, and AGE are not significant. {r, echo=TRUE, message=FALSE,tidy=TRUE} vif(fit.lm1) ``` Variance inflation factors are computed using vif() for the standard errors of linear model coefficient estimates. It is imperative for the vif to be less than 5 for all the features. We see that the vif is greater than 5 for RAD and TAX.
log(MEDV) ~ CRIM + CHAS + NOX + RM + DIS + PTRATIO + RAD + B + LSTAT
{r, echo=TRUE, message=FALSE,tidy=TRUE}
This model is marginally less accurate than linear model 2, based on slight increase in RMSE and slight decrease in R2R2 value. Let us plot the predicted vs actual values of the outcome MEDV.
plot(pred.lm1, testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price")
# diagnostics plots
layout(matrix(c(1, 2, 3, 4), 2, 2))
plot(fit.lm1)
@@#LINEAR MODEL 2 – TABLE Table Shoming 1st six observations – Actual vs Predicted Price
table <- data.frame(x = pred.lm1, y = testing$MEDV)
names(table) <- c(xlab = "Predicted_Price", ylab = "Actual_Price")
data.table(table, 6)
## Predicted_Price Actual_Price V2
## 1: 3.424905 24.0 6
## 2: 3.215083 21.6 6
## 3: 3.243884 28.7 6
## 4: 2.901499 15.0 6
## 5: 3.071125 18.9 6
## ---
## 146: 2.732105 19.7 6
## 147: 2.936100 17.5 6
## 148: 2.997094 16.8 6
## 149: 3.090174 22.4 6
## 150: 3.084712 11.9 6
For random forest implementation, we could use the linear model formula of MEDV ~ . (meaning MEDV is the outcome with all other features as input). Inspecting the results, we see that the random forest model has given the best accuracy so far. Better model peformance is expected.
library(randomForest)
set.seed(12345)
fit.rf <- randomForest(formula = MEDV ~ ., data = training)
fit.rf
##
## Call:
## randomForest(formula = MEDV ~ ., data = training)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 4
##
## Mean of squared residuals: 11.93118
## % Var explained: 86.2
pred.rf <- predict(fit.rf, testing)
rmse.rf <- sqrt(sum(((pred.rf) - testing$MEDV)^2)/length(testing$MEDV))
table <- c(RMSE = rmse.rf, Rsquared = mean(fit.rf$rsq))
print(table)
## RMSE Rsquared
## 3.365975 0.855014
plot(pred.rf, testing$MEDV, xlab = "Predicted Price", ylab = "Actual Price",
pch = 3)
Table Shoming 1st TEN observations – Actual vs Predicted Price
table1 <- data.frame(x = pred.rf, y = testing$MEDV)
names(table1) <- c("Predicted_Price", ylab = "Actual_Price")
head(kable(table1), 10)
## [1] " Predicted_Price Actual_Price"
## [2] "---- ---------------- -------------"
## [3] "1 29.033943 24.0"
## [4] "2 23.418307 21.6"
## [5] "6 27.781137 28.7"
## [6] "11 21.165063 15.0"
## [7] "12 21.218273 18.9"
## [8] "20 19.807877 18.2"
## [9] "22 18.389208 19.6"
## [10] "25 17.227528 15.6"
Linear_Model_1 <- c(RMSE = rmse.lm/10, R2 = summary(fit.lm)$r.squared)
Linear_Model_2 <- c(RMSE = rmse.lm1/10, R2 = summary(fit.lm1)$r.squared)/10
Random_Forest_Model <- c(RMSE = rmse.rf, R2 = mean(fit.rf$rsq))
model_comparison <- rbind(Linear_Model_1, Linear_Model_2, Random_Forest_Model)
kable(model_comparison)
| RMSE | R2 | |
|---|---|---|
| Linear_Model_1 | 0.4826464 | 0.7490908 |
| Linear_Model_2 | 0.0461248 | 0.0461248 |
| Random_Forest_Model | 3.3659750 | 0.8550140 |
We experimented with a couple of linear regression models and a random forest model to predict the housing prices in Boston suburbs. Among these models, the Random forest model with a simple linear relationship between the outcome and all input features yielded the best model to predict outcomes, as determined from having the smallest RMSE (i.e., root mean squared error) and the highest R2 (i.e., accuracy | R-squared statistic), and the smallest p-value (greatest significance).